Implementing MPI with Optimized Algorithms for Metacomputing
نویسندگان
چکیده
| This paper presents an implementation of the Message Passing Interface called PACX-MPI. The major goal of the library is to support heterogeneous metacomputing for MPI applications by clustering MPP's and PVP's. The key concept of the library is a daemon-concept. We will focus in this paper on two aspects of this library. First we will show the importance of the usage of optimized algorithms for the global operations in such a metacomputing environment. And second we want to discuss whether we introduce a bottleneck by using daemon-nodes for the external communication. Keywords|MPI, Metacomputing, Global Operations I. Why another MPI Implementation ? In the last couple of years a large number of tools and libraries have been developed to enable the coupling of computational resources, which may be distributed all over the world [5], [6], [8], [9], [12],[13]. The goal of such projects is usually to solve problems on a cluster of machines which cannot be solved by using a single Massively Parallel Processing System (MPP) or Parallel Vector Processors (PVP). PACX-MPI (PArallel Computer eXtension)[10] is an implementation of MPI which tries to meet the demand of distributed computing. While most vendor implemented libraries do not support interoperability between di erent MPI-libraries, PACX-MPI makes the MPI-calls available across di erent platforms. The Interoperable MPI approach (IMPI) [11] may solve this problem, but still this is no standard and thus there are not yet any MPIimplementations according to these speci cations. A lot of features as described in the IMPI draft document are however re ected in the PACX-MPI concept although it is also far from providing the full required functionality. MPICH [7] as the mainly used MPI distribution supports a lot of platforms and supports the coupling of machines, too. The major disadvantage of this implementation is, that one may run in di culties by coupling e.g. two machines with 512 nodes each, because of the number of open ports. In the worst case one may end up with 511 open ports on each node. The number of open ports is furthermore for all machines of importance, which are protected by some kind of rewalls. The less ports one has to use for the coupling of di erent resources, the easier is it to open and to control those few ports. Some other approaches to achieve interoperability have been made. PVMPI [5] makes MPI applications run on a cluster of machines by using PVM for the communication between the di erent machines. Unfortunately the user can use only point-to-point operations and he has to add some non MPI congruent calls. The subsequent project, MPI Connect uses the same ideas but replaced PVM by a library called SNIPE [6], and supports now global operations too, in contrary to PVMPI. A similar approach has been done by PLUS [2]. This library additionally supports communication between di erent message-passing libraries, like e.g. PARMACS, PVM and MPI. But again the user has to add some calls to his application. Another project called Stampi [12] has been recently presented. This project already uses the MPI2 process model, but focuses mainly on local area computing. Referring to the experiences of a lot of those e orts, this paper presents the concept and results that were achieved with PACX-MPI especially with respect to optimized communication algorithms. The concept of the paper is as follows. The second section describes the main concepts and the main ideas of PACX-MPI. In the third section we focus on optimizing global operations for metacomputing. Afterwards we discuss in the fourth section whether we introduce a bottleneck by using daemon nodes for the external communication. In the fth section we present some applications which used PACX-MPI during the Supercomputing'98 event in Orlando. Some optimization e orts are described there. In the last section we brie y describe the ongoing work and the future activities in this project. II. Concept of PACX-MPI Before we start to describe the concept of PACX-MPI, we have to de ne, for what kind of clusters we want to use it. With PACX-MPI we do not intend to cluster workstations or even small MPPs or PVPs to simulate a big parallel machine. Our goal is to couple big resources to simulate machines, which can be hardly build nowadays. This includes that these machines are usually not in the same computing center and therefore we have to deal with latencies between the machines, which are in a complete di erent range than the latencies inside a single machine. To couple di erent MPP's and PVP's, PACX-MPI has to distinguish between internal and external operations. Internal operations are executed by using the vendorimplemented MPI-library, since these are highly optimized. Furthermore this is nowadays the only protocol, which is accessible on each machine and which can exploit the full capabilities of the underlying network of an MPP. Therefore PACX-MPI can be described as an implementation of MPI on the top of the native MPI-libraries. External operations, e.g. point-to-point operations between two nodes on di erent machines, are handled by a di erent standard protocol. Actually PACX-MPI supports only TCP/IP, but we will add some other protocols like native ATM in the frame of a European project in the future. In this sense PACX-MPI can be described as a tool to provide multi-protocol MPI for metacomputing. To avoid that each node has to open ports if it wants to perform some external operations, PACX-MPI uses two specialized daemon nodes for the external communication on each machine. Using these daemon nodes we can minimize the number of open ports and we can use xed portnumbers. These two nodes are transparent for the application, and are therefore not part of global communicators, like e.g. MPI COMM WORLD. Figure 1 shows a con guration of two machines, each using 4 nodes for the application and how MPI COMM WORLD looks like in this example. On the left machine, which shall be the machine with the number one, the rst two node with ranks 0 and 1 are not part of MPI COMM WORLD, since these are the daemon nodes. The next node with the rank 2 is therefore the rst node in our global communicator and gets the global rank number 0. All other application nodes get a global number according to their local ranks minus two, the last node on this machine has the rank 3. On the next machine, the daemon nodes again are not considered in the global MPI COMM WORLD. The node with the local rank 3 is number 4 in the global communicator, since the numbering on this machine starts with the last global rank on the previous machine plus one.
منابع مشابه
An Extension to MPI for Distributed Computing on MPPs
We present a tool that allows to run an MPI application on several MPPs without having to change the application code. PACX (PArallel Computer eXtension) provides to the user a distributed MPI environment with most of the important functionality of standard MPI. It is therefore well suited for usage in metacomputing. We are going to show how two MPPs are conngured by PACX into a single virtual ...
متن کاملImplementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer
BlueGene/L is a massively parallel supercomputer that is currently the fastest in the world. Implementing MPI, and especially fast collective communication operations can be challenging on such an architecture. In this paper, I will present optimized implementations of MPI collective algorithms on the BlueGene/L supercomputer and show performance results compared to the default MPICH2 algorithm...
متن کاملFT-MPI, Fault-Tolerant Metacomputing and Generic Name Services: A Case Study
There is a growing interest in deploying MPI over very large numbers of heterogenous, geographically distributed resources. FT-MPI provides the fault-tolerance necessary at this scale, but presents some issues when crossing multiple administrative domains. Using the H2O metacomputing framework, we add cross-administrative domain interoperability and pluggability to FT-MPI. The latter feature al...
متن کاملMatrix Multiplication on Heterogeneous Platforms
ÐIn this paper, we address the issue of implementing matrix multiplication on heterogeneous platforms. We target two different classes of heterogeneous computing resources: heterogeneous networks of workstations and collections of heterogeneous clusters. Intuitively, the problem is to load balance the work with different speed resources while minimizing the communication volume. We formally sta...
متن کاملWide-Area Implementation of the Message Passing Interface
The Message Passing Interface (MPI) can be used as a portable, high-performance programming model for wide-area computing systems. The wide-area environment introduces challenging problems for the MPI implementor, due to the heterogeneity of both the underlying physical infrastructure and the software environment at di erent sites. In this article, we describe an MPI implementation that incorpo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999